Goto

Collaborating Authors

 key-value pair


Augmenting Language Models with Long-Term Memory

Neural Information Processing Systems

Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, Language Models Augmented with Long-Term Memory (LONGMEM), which enables LLMs to memorize long history. We design a novel decoupled network architecture with the original backbone LLM frozen as a memory encoder and an adaptive residual side-network as a memory retriever and reader. Such a decoupled memory design can easily cache and update long-term past contexts for memory retrieval without suffering from memory staleness. Enhanced with memory-augmented adaptation training, LONGMEM can thus memorize long past context and use long-term memory for language modeling.




Benefits and Limitations of Communication in Multi-Agent Reasoning

arXiv.org Artificial Intelligence

Chain-of-thought prompting has popularized step-by-step reasoning in large language models, yet model performance still degrades as problem complexity and context length grow. By decomposing difficult tasks with long contexts into shorter, manageable ones, recent multi-agent paradigms offer a promising near-term solution to this problem. However, the fundamental capacities of such systems are poorly understood. In this work, we propose a theoretical framework to analyze the expressivity of multi-agent systems. We apply our framework to three algorithmic families: state tracking, recall, and $k$-hop reasoning. We derive bounds on (i) the number of agents required to solve the task exactly, (ii) the quantity and structure of inter-agent communication, and (iii) the achievable speedups as problem size and context scale. Our results identify regimes where communication is provably beneficial, delineate tradeoffs between agent count and bandwidth, and expose intrinsic limitations when either resource is constrained. We complement our theoretical analysis with a set of experiments on pretrained LLMs using controlled synthetic benchmarks. Empirical outcomes confirm the tradeoffs between key quantities predicted by our theory. Collectively, our analysis offers principled guidance for designing scalable multi-agent reasoning systems.


Revisiting associative recall in modern recurrent models

arXiv.org Artificial Intelligence

Modern recurrent deep learning models - such as state-space models (SSMs) - have emerged as a promising computationally efficient alternative to Transformers for sequence modeling. However, how their practical differences in learn-ability and optimization impact core capabilities remains underexplored. In this paper, we thoroughly compare SSM and Transformer learning dynamics on two fundamental benchmarks highly correlated with language modeling performance: associative recall and copying. We find that, while Transformers are robust to optimization hyperparameters, the performance of modern recurrent models suffers from critical instabilities: success is confined to an extremely narrow window of learning rates, outside of which accuracy drastically drops. This issue can confound performance evaluations and expressivity conclusions, revealing a fundamental mismatch in the loss landscape of modern recurrent models compared to Transformers. We demonstrate that this brittle optimization has a direct impact on scaling, causing SSMs to favor width over depth. Indeed, we also find that, while the 1-layer Transformer's performance on recall does not exceed random guessing, well-tuned Mamba and other SSMs can learn to recall with one layer, yet with dynamics that do not resemble the formation of induction heads. Taken together, our findings suggest that a crucial differentiator between these architectures lies not just in their expressivity but in their fundamental learnability properties, pointing to optimization stability as a key challenge for the future of SSMs. Since early developments (Rumelhart et al., 1986; Elman, 1990), RNNs have driven progress in machine learning techniques for sequential data, with milestones such as Echo-State Networks (Jaeger, 2001) LSTM (Hochreiter & Schmidhuber, 1997) and GRU (Cho et al., 2014). However, two problems severely limit the application of RNNs in modern times: first, GPU architectures struggle with sequential processing. Secondly, it is widely known that RNNs are hard to train due to vanishing and exploding gradients issues (Bengio et al., 1994; Hochreiter et al., 2001; Pascanu et al., 2013). These challenges have led to the introduction of a different paradigm: the Attention mechanism, implemented around the Transformer architecture (V aswani et al., 2017). Instead of processing inputs sequentially while building up internal memory (RNNs), Attention computes pairwise interactions between data points, allowing for modeling direct links between elements in a sequence and thus mitigating vanishing gradients.


Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

arXiv.org Artificial Intelligence

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at greater computational cost-has been relatively underexplored. In this work, we bridge this gap by proposing Local Linear Attention (LLA), a novel attention mechanism derived from nonparametric statistics through the lens of test-time regression. First, we show that LLA offers theoretical advantages over Linear and Softmax Attention for associative memory via a bias-variance trade-off analysis. Next, we address its computational challenges and propose two memory-efficient primitives to tackle the $ฮ˜(n^2 d)$ and $ฮ˜(n d^2)$ complexity. We then introduce FlashLLA, a hardware-efficient, blockwise algorithm that enables scalable and parallel computation on modern accelerators. In addition, we implement and profile a customized inference kernel that significantly reduces memory overheads. Finally, we empirically validate the advantages and limitations of LLA on test-time regression, in-context regression, associative recall and state tracking tasks. Experiment results demonstrate that LLA effectively adapts to non-stationarity, outperforming strong baselines in test-time training and in-context learning, and exhibiting promising evidence for its scalability and applicability in large-scale models. Code is available at https://github.com/Yifei-Zuo/Flash-LLA.


The Media Bias Detector: A Framework for Annotating and Analyzing the News at Scale

arXiv.org Artificial Intelligence

Mainstream news organizations shape public perception not only directly through the articles they publish but also through the choices they make about which topics to cover (or ignore) and how to frame the issues they do decide to cover. However, measuring these subtle forms of media bias at scale remains a challenge. Here, we introduce a large, ongoing (from January 1, 2024 to present), near real-time dataset and computational framework developed to enable systematic study of selection and framing bias in news coverage. Our pipeline integrates large language models (LLMs) with scalable, near-real-time news scraping to extract structured annotations -- including political lean, tone, topics, article type, and major events -- across hundreds of articles per day. We quantify these dimensions of coverage at multiple levels -- the sentence level, the article level, and the publisher level -- expanding the ways in which researchers can analyze media bias in the modern news landscape. In addition to a curated dataset, we also release an interactive web platform for convenient exploration of these data. Together, these contributions establish a reusable methodology for studying media bias at scale, providing empirical resources for future research. Leveraging the breadth of the corpus over time and across publishers, we also present some examples (focused on the 150,000+ articles examined in 2024) that illustrate how this novel data set can reveal insightful patterns in news coverage and bias, supporting academic research and real-world efforts to improve media accountability.


f6876a9f998f6472cc26708e27444456-AuthorFeedback.pdf

Neural Information Processing Systems

We thank all reviewers for their thoughtful comments. "The method is only compared to prior models with long-term memory on the [QA] task, and doesn't perform as " This is expected as these are ML models with non-biological Our goal was to show that simple local Hebbian plasticity can be utilized to solve many of these tasks. "Is it essential that the key-value Our goal was to show that simple local plasticity is sufficient for many tasks. "How and why do the query and storage keys "[...] isn't it possible to achieve good performance on the tasks in the paper This approach is rather close to the approach of MemN2N. "[...] it would be helpful to explain the practical or physiological relevance in more detail.


Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length

arXiv.org Artificial Intelligence

Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.


Recursive Question Understanding for Complex Question Answering over Heterogeneous Personal Data

arXiv.org Artificial Intelligence

Question answering over mixed sources, like text and tables, has been advanced by verbalizing all contents and encoding it with a language model. A prominent case of such heterogeneous data is personal information: user devices log vast amounts of data every day, such as calendar entries, workout statistics, shopping records, streaming history, and more. Information needs range from simple look-ups to queries of analytical nature. The challenge is to provide humans with convenient access with small footprint, so that all personal data stays on the user devices. We present ReQAP, a novel method that creates an executable operator tree for a given question, via recursive decomposition. Operators are designed to enable seamless integration of structured and unstructured sources, and the execution of the operator tree yields a traceable answer. We further release the PerQA benchmark, with persona-based data and questions, covering a diverse spectrum of realistic user needs.